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Crimeware-as-a-service (CaaS) presents a growing cybersecurity threat by 
facilitating the acquisition of tools for website phishing attacks. Detecting 
these attacks requires effective techniques to obtain accurate results in real 
time. Cloud machine learning (CML) emerged as a promising solution with 
the powerful tools of amazon web services (AWS). This study proposes a 


novel approach combining cyber kill chain concept with AWS technologies 

to enhance website phishing detection, using AWS SageMaker to preprocess 
Keywords: an 11,430 uniform resource locators (URL) dataset and train 3 algorithms, 
AWS which are: decision tree (DT), random forest (RF), and support vector machine 
: . (SVM), evaluate their performance through batch transform, and deploy them 
Crimeware-as-a-service as separate endpoints. Prediction functions are then conducted on each 
Cyber kill chain endpoint and compared to batch transform results. Our findings demonstrate 
Decision tree that the combination of the cyber kill chain concept and AWS CML 
Random forest significantly enhances website phishing detection by achieving results of 97% 
Support vector machine for RF in 0.48 seconds, 94% for SVM in 0.94 seconds, and 93% for DT in 


Website phishing 0.52 seconds. By leveraging CML algorithms and breaking down attacks into 
stages, our approach identifies and disrupts attacks earlier, preventing 
damage. This research highlights the value of our approach in improving 
cybersecurity and protecting against website phishing attacks. 
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1. INTRODUCTION 

The increasing digitization of the world led to a rise in cybercrime, particularly through particularly 
through crimeware-as-a-service (CaaS), where criminal organizations provide illicit services and tools for 
online attacks like phishing, malware distribution, and data theft [1]. CaaS operates similarly to legitimate 
software-as-a-service (SaaS) providers but is illegal and used for criminal activities [2]. Phishing attacks, 
facilitated by CaaS, are social engineering attacks that target individuals and organizations for financial gain 
or system harm [3], [4]. These attacks involve tricking victims into providing confidential information or 
downloading malicious entities by copying or modifying legitimate websites. Phishing attacks are becoming 
more sophisticated with the use of advanced technologies, making them harder to detect [5]. 

Cyber kill chain, a concept that has been developed by Lockheed Martin, describes the stages of a 
cyberattack and is widely adopted in cybersecurity. Understanding and disrupting these stages is crucial for 
organizations to defend against cyber threats [6]. Cloud machine learning (CML) integrates machine learning 
with cloud computing, providing scalable platforms for detecting and analyzing phishing attacks [7]. Machine 
learning as a service (MLaaS) is an essential part of website phishing detection, offering tools for efficient 
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execution of machine learning models without considering the underlying service structure. CML enables 
quick and accurate analysis of large data volumes to identify attack patterns and anomalies [8]. 

Amazon web services (AWS) is a cloud-based platform that offers a scalable environment for creating 
machine learning models. It combines virtual hardware and software resources, including $3, SageMaker, 
cloudwatch, and EC2, to process large amounts of data efficiently [9]. AWS provides developers with full 
control over their working environment and is designed for computationally intensive workloads [10]. AWS 
SageMaker and cloudwatch are integral components for developing and monitoring machine learning models 
[11], [12]. Previous studies have addressed website phishing detection using machine learning models, but 
faced challenges with real-time processing and limited incorporation of cloud computing resources or the cyber 
kill chain concept [13]—-[22]. To overcome these limitations, new techniques have emerged that integrate cloud 
computing and the cyber kill chain concept, improving detection accuracy and speed. 

This paper presents a method that utilizes CML and the cyber kill chain concept for enhanced website 
phishing detection. The approach involves preprocessing a dataset of uniform resource locators (URL) using 
AWS SageMaker and S3, training three machine learning algorithms, evaluating performance with SageMaker 
batch transform, deploying algorithms as separate endpoints with SageMaker endpoints, and conducting 
prediction functions. Results from endpoint prediction are compared to those from batch transform, and time 
measurements are obtained using AWS cloudwatch logs. AWS EC2 provides virtual resources for the process, 
aiming to improve accuracy and speed of website phishing detection by leveraging AWS machine learning 
services and the cyber kill chain concept. 


2. METHOD 

This study presents a cloud-based ML model that aims to enhance the accuracy and speed of website 
phishing detection by combining the steps of the cyber kill chain concept and AWS. Figure | represents the 
framework of the suggested CML model in the case of an attack. Figure 2 represents the steps of the proposed 
model which consists of seven stages. 

In the first stage, the main focus is to align the reconnaissance step of the cyber kill chain concept 
with the data collection stage. This alignment ensures a systematic and methodical approach to gathering the 
required dataset. During this stage, the system first acquires the dataset from the designated source, whether it 
be from internal records or external data providers. Subsequently, a comprehensive analysis of the obtained 
data is conducted to understand its structure, quality, and relevance to the project's objectives. Once the dataset 
is thoroughly analyzed and validated, it is then uploaded to the AWS S3 bucket. 

The second stage involves data preprocessing, which adapts the weaponization step of cyber kill chain 
model. This stage preprocesses the dataset by performing six steps, such as feature reduction, feature selection, 
standardization, label encoding, dataset splitting, and feature scaling. These preprocessing steps are performed 
using SageMaker studio notebook and the results are stored in the S3 bucket while the preprocessing resources 
are obtained from AWS EC2 instance (ml.m4.xlarge) which provides four virtual CPUs, 16GB of memory, 
and high network performance for 32/64-bit operating systems. 

In the third stage, the delivery step of the cyber kill chain concept is adapted to the training stage of 
the proposed model, where three ML algorithms (decision tree (DT), random forest (RF), and support vector 
machine (SVM)) are trained using SageMaker studio python scripts. The results of the training are stored in 
the S3 bucket, the time of training is recorded in AWS cloudwatch, and the resources required for training are 
obtained from the EC2 instance (ml.m4.xlarge). In the fourth stage, the exploitation step of the cyber kill chain 
concept is adapted to the batch transform stage of the proposed model, which uses the training results to obtain 
evaluation results for each algorithm to measure model performance and accuracy, batch transform time for 
each algorithm is calculated in cloudwatch and the required resources are obtained from the EC2 instance 
(ml.m4.xlarge). 

The fifth stage the fifth step involves adapting the installation process of the cyber kill chain concept. 
This adaptation enables the deployment of each algorithm using three separate single-model endpoints, which 
enhances the model's scalability and efficiency. Furthermore, during the deployment, the system records the 
deployment time in CloudWatch, allowing for monitoring and analysis of performance metrics. To support the 
deployment process, the system acquires deployment resources from an EC2 instance with the specifications 
(ml.m4.xlarge). 

In the sixth stage, the primary objective of the sixth stage is to adapt the command-and-control step 
of the cyber kill chain concept to the prediction stage of the proposed model. This adaptation is crucial for 
establishing a seamless and efficient prediction process. In this stage, the system executes a prediction function 
on each endpoint, enabling the model to generate accurate and reliable final prediction results. 

The final stage of the proposed model adapts the action on objectives step of the cyber kill chain 
concept, which involves obtaining the prediction results and comparing them to the results of batch transform. 
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The time required for prediction is recorded in cloudwatch and compared to the time of the batch transform as 
well. By leveraging AWS services and the cyber kill chain concept, the proposed model aims to enhance the 


speed and accuracy of website phishing detection. 
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Figure 1. The framework of the proposed model in case of an attack 
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Figure 2. The steps of the suggested model 


2.1. Dataset collection and description 

In the dataset description stage, the first step of cyber kill chain, reconnaissance, is combined and 
adapted. This step is where the attacker gathers information about the target through passive reconnaissance or 
active reconnaissance. In our proposed model, the dataset that has been utilized in the present study, as 
described in Table 1, is named “web page phishing detection.” It is downloaded from Mendeley data, 
encompasses 11,430 URLs, and was created in May 2020. The dataset contains 89 features, with 87 being 
numeric, one being a nominal feature representing the dataset’s index in URLs, and one categorical feature 
representing the labels. The 87 retrieved features were divided by the study to 3 groups: 56 features generated 
from URL syntax, 24 features derived from the website content, and 7 features derived from website services. 


Table 1. Description of dataset used in the proposed model 


Characteristic Value 
Name Webpage phishing detection 
Source Mendeley data 
Type Binary/Supervised 
No. of samples 11430 
No. of extracted features 87 
Types of features Numeric, nominal, and categorical 
Missing data No 
Balanced Yes 


2.2. Dataset preprocessing 

In this stage, the concept of the weaponization step of the cyber kill chain is adopted. At this stage, 
the attacker creates a weapon to exploit the vulnerabilities and weaknesses identified during the reconnaissance 
stage. In this study, based on the analysis of the dataset, we reached the conclusion that the dataset is balanced, 
containing 50% legitimate URLs (5,715 samples) and 50% phishing URLs (5,715 samples) with no missing 
values, in that way simplifying the preprocessing step into the following six procedures: 

First, feature selection was conducted using Pearson correlation to eliminate highly correlated 
features, reducing redundancy and potential negative impacts on model performance. Three features with a 
correlation ratio above 90% were removed. Next, feature reduction was applied by removing the nominal 
feature “URL,” which represented the names of the URLs in the dataset, aiming to enhance model accuracy by 
reducing noise. Categorical label values were encoded into numeric values through label encoding, ensuring 
compatibility with machine learning algorithms. The “status” feature representing dataset classes was encoded, 
with “legitimate” converted to 0 and “phishing” converted to 1. Standardization using python’s standardScaler 
Q was employed to transform features, giving them an average value of 0 and a standard deviation of 1. This 
step was crucial for algorithms like SVM that require standardized data. Standardization was performed on 
features with values higher than 100. 
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Additionally, feature scaling was performed using python’s MinMaxScaler () to rescale data within a 
fixed range, typically between 0 and 1. This normalization prevented the dominance of certain features over 
others and was applied to features with values between 10 and 100. The data-set has been split to train and test 
datasets using 70:30 ratio, with 70% (8,001 samples) assigned to the train dataset and 30% (3,425 samples) to 
the test dataset. Finally, the datasets were uploaded to S3 for further processing, ensuring easy access and 
storage for the proposed model. 


2.3. Training algorithms 

This stage will be matched to the delivery step of the cyber kill chain concept. This step involves the 
delivery of the weapon to the target by the attacker through various methods. This study will use three 
algorithms to train the model and evaluate its performance. These algorithms are DT, RF, and SVM. These 
algorithms will be embedded into the SageMaker notebook using separate python scripts for each algorithm. 


2.3.1. Decision tree 

Machine learning algorithms like DT could be applied to regression and classification problems. It 
creates a model that resembles a tree, with internal nodes standing in for the data-set’s features, branches for 
decision-making processes depending on the feature values, and leaves for the class labels. The feature that 
maximizes the information gain or minimizes the impurity measure is used by the method to recursively divide 
the data into smaller subgroups. DT models can manage both numerical and categorical data, are easily 
interpretable and visualized, and can handle high-dimensional datasets. These are just a few benefits of using 
DT models. They are also simple and easy to understand, validate, and interpret. Given a set of training samples 
{(%1, V1), (2) V2), +» Xn» Vn) } Where x; is the feature vector and y; is the corresponding class label, a DT is 
constructed recursively using the following steps: choose the most suitable feature to divide data based on a 
criterion like Gini index or information gain. Based on the selected feature, divide the data into two subsets. 
Once a stopping requirement, like maximum depth or minimal number of samples per leaf, has been met, 
recursively repeat steps | and 2 on each subset. Assign the predicted class label to the sample that has the 
highest percentage of that class within each leaf node. Depending on a measurement of how well the feature 
divides the classes, the best feature to split the data is chosen. Information gain, which is a frequently used 
metric and is defined as: 


IG(D, f) = HM) — H(DIf) (1) 


where D is the set of training samples, fis a feature, H(D) is the entropy of D, and H(D|f) is the conditional 
entropy of D given f. Entropy is a measure of impurity or uncertainty, and is defined as: 


H(D) = —X(p@ * log, pW) (2) 


where p(i) is the proportion of samples in class i. The higher the entropy, the more uncertain the class 
distribution. The conditional entropy is defined as: 


H(DIf) = YC! « Hw) (3) 


|D| 


where D(j) represent the subset of D for which feature f has value j. The lower the conditional entropy, the 
better the feature separates the classes [23]. 


2.3.2. Random forest 

The RF is another popular algorithm which can be utilized for the tasks of classification and 
regression. It’s a collection of a number of the trees that are trained independently on selected training data- 
sets. The information of the classification is determined then through voting amongst all trained trees. therefore, 
RF typically achieves a better accuracy of classification in comparison with a single tree. The algorithm is 
referred to as the random forest due to the way that the randomness is injected into tree-building process for 
ensuring that every tree differs. Considering a set of the training samples {(x,, y), (2, V2), ++» Xn» Yn) }, where 
x; is the feature vector and y; is the corresponding class label, RF algorithm constructs a model which predicts 
binary class label of a new input vector x as follows: randomly choose a sub-set of the features from the original 
set of feature: Let F be the total number of the features and k be the number of the features to select at every 
split. Selecting k features may be randomly carried out with the use of a uniform distribution, and k value can 
be set as function of F, usually, VF, construct a decision tree with the use of a sub-set of the training samples, 
which are chosen by either bootstrap sampling or sub-sampling with no replacement: Let S' be the total number 
of the training samples and m be the number of the samples to select for every tree. The sampling may be 
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carried out either with the replacement (i.e., bootstrap sampling) or with no replacements (i.e., sub-sampling). 
The value of m may be set as function of S, usually, S/3, repeat step 1 and step 2 for a fixed number of times 
or to the point where a stopping criterion is met: Let T be total number of the trees to construct. T value can be 
set through the cross-validation or grid search, and aggregate predictions of all of the DTs through either 
weighted voting or majority voting: Let T;(x) be the binary prediction of the tree i for input vector x. In the 
majority voting, the final prediction can be expressed by: 


h(x) = argmax (1 <j< C) YT; (x) =j (4) 


where C represent the number of the classes, and argmax represent the function returning the index of 
maximum element in the vector. In weighted voting, each tree is assigned a weight based on its performance, 
and the final prediction is given by: 


h(x) = argmax (1 < j < C)Y(W,T;(x~) =s/) (5) 


where w; is the weight of tree i, and the weights can be calculated based on the accuracy or the error of each 
tree [24]. 


2.3.3. Support vector machine 

Popular algorithm SVM offers a quick and effective implementation. SVM creates a hyperplane, a 2D 
line dividing the classes, throughout the training phase. In order to identify whether a webpage is being utilized 
for phishing, SVM analyzes a variety of factors, like the presence or lack of specific characters. The most 
popular kernel for classification, especially in datasets with a lot of features, is the linear kernel. Because of its 
success in enhancing dataset performance and performance assessment, this kernel is extensively utilized. In 
machine learning, binary classification problems involve classifying data into two categories, which can be 
represented as y; € {1,—/} Wx;. The training data-set contains all x; samples as well as their corresponding y; 
values that indicate the class to which data belongs. The SVM algorithm attempts to find an optimal hyperplane 
that can divide data to two regions, with all positive class samples in one region and negative class samples in 
the other. Predictions for new data can be made through evaluating its position on the graph and returning a 
positive or negative value using a sign function. None-the-less, it is seldom possible to linearly solve binary 
classification problems, as a d-dimensional graph cannot be separated by a d -1-dimensional hyperplane. 
Therefore, kernel functions are utilized in order to map data to a higher dimension. In the event that the data 
remains inseparable even after using a kernel function, the SVM algorithm aims to minimize the error rate. 
Given a vector X of input vectors x; € R, for i=/ to n, where p is the number of input features, and a vector Y 
€ {—1, 1}n of classes y;, where n is the number of input vectors, SVM algorithm solves the following problem: 


minyg2=|lvll? + CD12 (6) 
subject to, 

hiv?x, +b) > 1- zi =1,...,n (7) 

HS OT con (8) 


where the hyperplane is written as h = v7 X + q, v is the vector and gq are the scalar quantity. The parameter 
C is utilized to regulate the tradeoff between bias and variance of the model by penalizing misclassifications. 
Lower values of C result in higher bias, while larger values increase variance. To avoid overfitting of the 
dataset, positive slack variables 2; are introduced for regularization. Through forming the Lagrangian equation 
of the problem and the substitution of values with the use of Karush-Kuhn-Tucker conditions, the dual problem 
can be obtained, which can be formulated as follows: 


1 
MAX.W (a) = Yiar Mi — 5 Lier Lj=1 Vj MiOYK (Xi, X}) (9) 
subject to, 
0 <0: 2-66 HT cot (10) 
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where is a vector that consists of all Lagrangian multipliers a;, K (xi, x;) = 0(x;)" O(x;) is the kernel 


function of feature mapping @ that maps from input x to its features in higher dimension @(x). Similarly, we 
can express the dual in vector form as: 


mings a" Qa — eTa (12) 
subject to, 
OS G6 SS Tot (13) 


where Q;; = yy, K (x;x;) and e is a vector of ones. Radial basis function (RBF) kernel has been utilized as 
kernel function for mapping data to higher dimensions. Because of its accuracy and dependable performance, 
RBF kernel function has gained popularity among many researchers in various domains. RBF kernel function 
can be described as follows: 


K(x;x;) = exp (—y||x; — %;|17) (14) 


which is why, for constructing SVM model, it takes in 2 parameters, which are: C and y that can be optimized 
with the use of the optimization algorithms [25]. 


2.4. SageMaker scripts 

Through simply calling the algorithm inside SageMaker notebook, AWS SageMaker has been known 
to utilize the amazon machine learning (AML) algorithm library to give highly accurate built-in algorithms for 
the development and training of the scalable and effective cloud ML models. Not all ML techniques, 
nevertheless, are supported by the SageMaker environment. For building the models, the developers 
occasionally need to incorporate adjustable algorithms. In this situation, algorithm scripts offer a more 
adaptable way to execute algorithms with greater control over input/output format and hyperparameters, as 
well as the capability for containing data loading, training, pre-processing, and saving steps all in one script. 
This enables developers to improve the performance of their models and modify the algorithms to suit the 
particular requirements of their use cases, 3 scripts will be used in this work to activate three different ML 
algorithms for training and deploying models: RF, DT, and SVM. The training results will be put on S3 and 
later retrieved for the prediction stage. 


3. RESULTS AND DISCUSSION 

After training the model using the three algorithms mentioned previously, the training artifacts of each 
algorithm are uploaded to the S3 bucket and a batch transform procedure will be conducted on these artifacts 
to obtain the evaluation results of each algorithm. Next, endpoints will be created to deploy the model. Then, 
a prediction function will be conducted on each endpoint to obtain the final results, Finally, a comparison of 
results and times required to obtain these results with those of batch transform will be applied to determine the 
best accuracy and the highest speed of the acquired results. 


3.1. Evaluation using batch transform 

In this study, batch transform will be adapted to the exploitation step of the cyber kill chain concept. 
At this stage, the attacker exploits the vulnerability using the weapon that was delivered. Figure 3 represents 
the steps of batch transform of the proposed model. Through breaking the process down into records or mini- 
batches, the batch transform SageMaker feature may be used to create predictions from a trained model on 
large data-sets. As shown in the mentioned figure, it is appropriate for the one-time processing, provides 
insightful information about the status and effectiveness of the transform task, and stores the generated 
predictions in $3. Batch transform is more economical and flexible than real-time inference, but not being 
optimal for making predictions in real-time. It supports a broad variety of frameworks and models, could handle 
massive volumes of input data and scale to handle more data, and gives users the opportunity to customize 
input and output data formats. To analyze data gathered over a long period of time or to execute large-scale 
data analysis without having to worry about latency or online deployment, batch transform is often employed 
offline. Soltys [26] evaluation results are acquired with the use of 8 evaluation metrics: precision, accuracy, F- 
1, recall, receiver operating characteristic (ROC), negative prevalence, specificity, and mislabeling. Batch 
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transform is then performed on the training outcomes of each method using the test set. As a result of achieving 
the greatest accuracy score, RF surpassed both DT and SVM, according to the data. 

Figure 4 show the evaluation metrics used for the proposed model. The results show that RF had the 
highest accuracy score of 97% with high precision, F-1, recall, specificity, and negative prevalence, this 
indicates that the algorithm was able to predict the labels of most samples, correctly identify most positive 
samples with a low false alarm rate which leads to a high positive rate and low negative rate. RF also has the 
lowest mislabeling score of 0.03%, indicating that the dataset preprocessing is sufficient to process and the 
algorithm was able to learn from it effectively. RF classifier's excellent accuracy results could be ascribed to a 
number of variables. It starts by employing ensemble learning, which uses a group of DTs to lessen overfitting. 
Second, it uses randomness for reducing variance and noise when choosing features and data, which makes the 
model less sensitive to small changes in data. Also, it employs bagging methods to provide numerous data sets 
for each DT, reducing bias and enhancing generalization. Last but not least, the RF classifier lowers mislabeling 
by combining the output of various DT via a voting method. Those elements work together to give the RF 
classifier’s excellent accuracy performance. 

Secondly, SVM scored 94% accuracy with a 95% precision score, and similar F-1 and ROC scores. 
SVM scored 94% recall, 96% negative prevalence, and 0.05 mislabeling score, the reason behind these results 
returns to the nature of SVM binary classification algorithm in maximizing margin between the classes instead 
of the use of probabilistic models which makes it robust to outliers and make the algorithm focusing on the 
classification of the positive samples correctly even if that means classifying a few negative samples 
incorrectly. Also, the use of SVM of the weighted cost function is one of the reasons the algorithm has a high 
negative prevalence value. 

Additionally, DT achieved a 93% accuracy rating with a mislabeling score of 0.07% and comparable 
high precision, recall, F-1, negative prevalence, and specificity. Numerous variables contribute to DT 
classifier’s excellent accuracy. First of all, DT are simple to interpret and comprehend. Second, DT are resistant 
to outliers since they do not make any assumptions regarding the distribution of data. In addition, DT employ 
feature selection methods to lessen data noise. To record complex decision boundaries, the feature space is 
recursively split into binary splits. Last but not least, DT use trimming strategies to prevent overfitting. Those 
elements work together to give the DT classifier’s excellent accuracy performance. 
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Figure 3. Steps of the batch transform procedure of the proposed model 
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Figure 4. Comparison of evaluation metrics of the proposed method 


3.2. Deployment the model using endpoints 

This study will adapt the Installation step of the cyber kill chain concept to the endpoint deployment 
stage, installation involves installing the malicious entity on the target’s system by the attacker, while the action 
step involves. Endpoints are one of the key components of amazon SageMaker, which provides a way to deploy 
trained models and make predictions in real time. Endpoints are a critical aspect of building scalable models 
for use in real-world applications, such as phishing detection, fraud detection, and natural language processing. 
Figure 5 displays the steps of deploying the endpoint for the proposed model. After training the model using 
algorithms, a model artifact is created that contains the parameters and additional resources required by the 
model. Upon deployment, endpoints provide a URL that can be used to send new data to the model and return 
predictions based upon sent data. It is crucial to monitor and maintain endpoints to ensure the accuracy and 
effectiveness of the prediction. Endpoints have several advantages, including scalability to handle large 
datasets and high volumes of traffic, and the ability to make real-time predictions on new data that the model 
has not trained on. Endpoints can also avoid latency, even on large amounts of data. To create an endpoint, a 
model needs to be created which contains the container where the training results are stored on Amazon S3 
and the model artifacts needed to deploy the model. Then, the endpoint configuration is created which contains 
the destination path of the captured data and the resources needed to deploy the endpoint. Finally, the endpoint 
is created, and the time to create the endpoint is dependent upon several factors like the size of the data and the 
number of models it contains. In this work, three distinct, single-model endpoints will be utilized to deploy 
each of the algorithms utilized for training the suggested model. On each endpoint, a prediction function will 
be run to determine the prediction outcomes and time for each method. 
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Figure 5. The steps of endpoint deployment of the proposed model 
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3.3. Comparison of time 

By leveraging Amazon Cloud Watch logs, the training, batch transform, and prediction times for each 
algorithm have successfully computed. This data-driven approach forms the foundation of a robust 
performance comparison between the algorithms. Through this comprehensive analysis, valuable insights into 
the efficiency of each algorithm are gained, facilitating an informed decision-making process and ensuring an 
agile, and high-performing model for website phishing detection. 


3.3.1. Training time 

Figure 6 displays the training times for each one of the algorithms following recording the times for 
each algorithm. As a result of not requiring the optimization of intricate mathematical operations, the results 
indicated that DT recorded 153 seconds of training time. Additionally, the way where the DT splits the data 
might result in a decrease in overfitting and an improvement in generalization performance. The reason for this 
is that RF has a parallelization feature that allows it to handle large datasets through using parallel computing 
model and virtual multi-core processors that are offered by Amazon EC-2. 

Additionally, RF utilizes a subset sampling method for sub-sampling the training data, which 
decreases the amount of data that is processed at every node and, as a result, shortens the training time. The 
SVM achieved the quickest training time of 155.3 seconds due to its capacity to learn complex decision 
boundaries without explicitly computing the modified data. Additionally, SVM can optimize data effectively 
thanks to convex optimization when used with SVM. Additionally, compared to other algorithms, SVM’s 
nature makes it easier to train and requires less time for training data because there are less parameters to adjust. 
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Figure 6. Training time of the algorithms used in the proposed model 


3.3.2. Batch transform time 

The results indicate that RF had the shortest batch transform time, with 300 seconds as shown in 
Figure 7, whereas SVM and DT had slightly slower processing times of 310 and 320 seconds, respectively. 
These batch transform times were obtained for the 3 algorithms that have been utilized in the suggested model 
from the cloudwatch logs. Those findings suggest that while batch transform could process extremely large 
datasets with high accuracy and is more cost-effective in comparison with the real-time inference because users 
just pay for the time spent processing the input data instead of maintaining a persistent endpoint, batch 
transform setup could be time-consuming because it requires the creation and configuration of numerous 
resources and batch transform jobs. Which is why, it cannot be used to make predictions in real time. 
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Figure 7. Batch transform time of the algorithms used in the proposed model 
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3.3.3. Endpoint prediction time 

In this study, the last two steps of the cyber kill chain, namely command and control (C&C) and action 
on objectives, will be adapted to the last two stages of the proposed model. C&C involves establishing a 
command-and-control channel to communicate with the malicious entity and control the system, while in the 
action step the attacker achieves their objectives. After deploying the model using 3 separate single-model 
endpoints and conducting the prediction function on every endpoint, the AWS cloudwatch logs monitored the 
performance and recorded the time required for the prediction and evaluation. While the evaluation results 
remained unchanged after the endpoint deployment, the prediction time has decreased to a near real-time point 


as displayed in Figure 8. 
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Figure 8. Endpoint prediction time for the algorithms used in the proposed model 


The results showed that RF had the shortest prediction time of 0.48 seconds, DT recorded a prediction 
time of 0.25 seconds, while SVM recorded a prediction time of 0.49 seconds. The main reason why endpoint 
prediction time is faster than batch transform is that endpoint prediction requires much fewer computational 
resources and can be done quickly because it has all the necessary virtually unlimited resources and because 
the machine learning model is already trained, and the prediction can be generated quickly by passing the input 
data through the model’s pre-trained weights. Furthermore, endpoint prediction is often performed on dedicated 
hardware, which is optimized for fast computation, further reducing the prediction time. Figure 9 show the 
comparison of training, batch transform, and endpoint prediction times of the proposed model. 
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Figure 9. Comparison between training, batch transform, and endpoint prediction times of the proposed 
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4. CONCLUSION 

The cyber kill chain concept is combined with machine learning methods in this study to propose a 
novel approach for increasing the accuracy and speed of website phishing detection. To do this, the authors 
used SageMaker and S3 to preprocess a dataset of 11,430 URLs through AWS. With the use of SageMaker 
batch transform, three machine learning algorithms-DTs, RFs, and SVMs-were evaluated, trained, and 
deployed as individual endpoints. The times needed for such predictions have been derived with the use of 
AWS cloudwatch logs, and endpoint predictions have been compared to those produced with the use of batch 
transform. AWS EC2 delivered virtual resources in the form of the (ml.m4.xlarge) instance. The results 
specified that the RF algorithm had the shortest batch transform and endpoint prediction speeds, coming in at 
300 and 0.48 seconds, respectively, and had the highest accuracy score of 97%. With a short batch transform 
and endpoint prediction time of 310 seconds and 0.49 seconds, respectively, SVM achieved an accuracy score 
of 94%. With an endpoint prediction time of 0.52 seconds, a batch transform time of 320 seconds, and training 
time of 153 seconds, DT had a 93% accuracy rating. The authors came to the conclusion that their method, 
which makes use of AWS machine learning capabilities and the cyber kill chain idea, significantly increases 
website phishing detection accuracy and speed when processing large datasets. 
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